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ITorm*ref erenc€d and criterion*referenced testing vithin a program 
evaluation context are compared^ and a model for 
developing/validating criterion-referenced tests is introduced. 
(Author/GC) 



^i^i^i^i^i^tt^^tt^ttt^ttttttttttt* ************* 

♦ Beprcducticns supplied by EDBS ar€ the best that can be made ♦ 

* from the original document, * 

t*^**i^**i^********************************************** **************** 



u Sh oe partment of health, 

EOUCATtON A WELFARE 
NATIONAL INSTITUTE OP 
EOUCaTION 



IrtlS DOCUArtENT HAS SEEN RE PRO* 
OUCED EAACTtV AS RECEIVED PftO/^ 
THE PERSON OR ORGANISATION ORtGlN- 
AUNOlT POINTS Of VtEW OR OPINIONS 
STATEO do not NECES^ARItV REPRE* 
SENT OFFICIAL NATIONAL INSTITUTE OF 
EDUCATION POSITION OR POUlCV 



"PERMISSION TO REPROOUCE THIS 
MATERIAL HAS SEEN GRANTElXBV 



Construction and Use of Cr±terlon*8eferenced 
Tests In Program Evaluation Studies 

Jardc^ A. Giffovd and Ranald JSC. Harri>tetm 
University of MaaeackuBettB^ Amheret 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)/' 



Abstract 



The nupber of new educational programs has Increased dramatically 
since the mld-slxtles* While some programs are minor extensions of 
older programs » others represent completely new educational approaches. 
The Importance of comprehensive sumnatlve and formative evaluations of 
these new programs Is clearJ It Is equally clear to many administrators, 
program developers^ and evaluators that criterion* referenced tests are 
an essential type of Instrumentation for conducting program evaluation 
studies* Unfortunately, nearly all of the recently developed criterion* 
referenced testing technology applies to test development and uses with 
Individual scores (for example^ to monitor student progr^ss^ to diagnose 
student learning needs» and to certify students as high school graduates)* 
In program evaluation^ group Information Is of central Importance. It 
is not the caae», as some have assumed^ that testing technology developed 
for use of test scores with Individuals Is optimal for this purpose. 
We suggest that there Is some misdirection In testing projects due to 
this basic misunderstanding. Four steps in test development are 
different; (1) approach to Item selection^ (2) assessment of rell* 
ability^ (3) standard*settlng methods^ and (A) methods of test score 
reporting. The purposes of the paper will be to consider the first two 
steps and offer methods for handling them In preparing and using criterion^ 
referenced tests In program evaluation studies. In addition, prior to 
considering the two steps » a brief comparison of norm- referenced testing 
and criterion* referenced testing within the context of program evaluation 
Is offered and a model for developing and validating criterion-referenced 
tests Is Introduced* 
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Construction and Use of Criterion*Referenced 
Tests in Program Evaluation Studies^* »^ 

Janice A* Giffovd and Ronald K. Hambleton^ 
University of Mci&saahueetts^ Amherst 

The following questions are often addressed in order to determine 

the effectiveness and hence the iinpact of an educational program; 

•Are the objectives worthwhile? 

*Are the stated objectives being achieved? 

*Hotf does one program compare with another in accomplishing 
a common set of objectives? 

*What changes should be made to improve program effectiveness? 

The formal^ systematic search for answers to these and similar 
questions is termed program evaluation . During the past ten years > several 
models of evaluation have emerged (Glass & Ellett^ 1980). Since there is 
no single accepted definition of program evaluation^ these models differ in 
varying degrees » in the set of questions addressed^ and in the phases 
of the program implementation that are examined. Rather than attempt to 
present^ compare^and contrast the major evaluation models here» the 
reader is referred to any of the several excellent discussions of 
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evaluation models (Glass & Ellett» 1980; Ferloff , Ferloff » & Suasna* 
1976; Fophaiik» 1975; Uorthen & Sanders?* 1973)* However » In Its most 
general form, program evaluation lUy be thought of in terms of three 
phases* Phase one consists of the examination and evaluation of the 
goals of a program* That Is **Are the stated purposes of the program 
of value?^* Phase two consists of the examination and evaluation of 
the processes of the program* For exan^le, "Are the processes such 
that they facilitate the attainment of the stated program goals?" 
Finally^ phase three focuses on the measurement of program outcomes. 
That Is, "Have the stated goals and objectives been achieved?" 

In order to answer questions raised at any of the three phases* 
a program evaluator must begin by drawing on many of the measurement 
techniques commonly used ^by social* psychological and educational 
researchers* For example, in order to study the adequacy of the goals 
of a program* measurement may take the form of needs assessments* 
attitude scales or preference scales. In phase two* for examination 
of the process, questionnaires. Interview schedules* and observational 
Instruments may be helpful* Attitude scales, performance tasks and 

paper and pencil achievement measures are i^xaAples of techniques avail- 

' * 

able for the measurement of program outcomes* 

Since educational programs, in particular* are generally directed 
toward goals such as the acquisition of particular knowledge or skills* 
or the advancement to some desired performance level or achievement 
level by those Individuals served by a program. It Is crucial that an 
evaluator employ a performance or achievement Instrument sensitive 
enough to adequately reflect the ability a group or of the Indi- 
viduals In terms of the specific goals of the program* Unfortunately* 



norm* referenced paper and pencil inatrumeilt development techniques are 
less than ideal for conatructliig tests to measure individual and group 
accamplishmenta in relation to a set of program goals* Norm- referenced 
test development methods* which are well-known are aimed toward producing 
tests to reliably and validly rank or compare examinees* However, 
evaluators require test development methods that will permit them to 
design and to use instrumentation to determine what individuals and 
groups can and cannot dp in relation to a set of program goals* Criterion- 
referenced test development methods provide the answer since criterion- 
referenced tests are constructed to permit the interpretation of 
individual or group test scores in relation to a set of well-defined 
objectives (Popham» 1978a). 

Norm- referenced tests and criterion-referenced tests are designed 
to achieve different purposes and therefore the approaches to test 
construction and test score intetpretation will also differ* When 
these two types of tests are used incorrectly, problems' arise* For 
example* Carver (1973) argues convincingly that Coleman et al* (1966) 
in a well-known and often cited study of the impact of schooling used 
inappropriate instruments (norm^referenced tests rather than criterion* 
referenced tests) and therefore the data do not address the important 
question under study* that of the relationship between school differ- 
ences and level of achievement* 
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Following ten years or so of psychometric research, a well^ 
developed technology for building crlterion*referenced tests and 
using the derived test scores, exists (e*g*^ Hanibleton & Eignor^ 1979a; 
Popham, l978a)« Unfortunately, this technology is designed to con- 
struct tests for use in evaluating the perfonnance of individuals 
in relation to a set of well-defined goals or conopetency statements 
and therefore when group performance is of primary interest, as it 
is in program evaluation studies^ variations from the usual ways for 
building and using the tests will be necessary* Four steps in test 
development are different: (1) approach to item selection, (2) assess-* 
ment of score reliability. (3) standard-setting methodsi and (4) methods 
of test score reporting* The purposes of this paper will be to consider 
the first two steps and offer methods for handling them in preparing 
and using criterion-^ref erenced tests in program evaluation studies* 
In addition^ prior to considering the two steps, a brief comparison 
of non]>* referenced testing and criterion-referenced testing within the 
context of program evaluation is offered and a model for developing 
and validating criterion-^ref <|^nce4; tests is introduced* 
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Comparison of Criterion'^Referenced Tests 
and Norm-Referenced Tests in the Context 
o f Program Evaluation . 

The educational program evaluator» in the search for d suitable 
instrument, will quickly discover that the great majority of instruments 
are norm-referenced tests. For^ example* more than 95% of the instru* 
ments listed in the Eighth Mental Measurement Yearbook (Euros* 1978) 
are norm-referenced tests. Althqugh criterion-referenced measures 
have not been used to the sane extent as norm** referenced measures* 
there is a growing awareness of the importance of criterion- 
referenced measurement* 

Generally* a nonit^referenced test cannot be distinguished from 
d criterion^referenced test by appearance alone. The differences 
revolve primarily around three areas; specification of test contenti 
the selection of items> and interpretations of the scores.^ In com- 
paring CRTs to NRTs* it should be kept in mind that the'goal of NRTs 
is to represent "ability" in terms of other individuals * while the 
goal of CRTsis to represent **ability" Ijt terms of a Riven domain of 
content. 

The first step in the construction of any test is to specify^ 
in some manner i the content domain to be measured by a test. It is 
common for developers of both types o^ tests to begin with objectives. With 
criterion-referenced testsi however* it is essential to describe the objectives 



in considerably more detail. Added clarity can be obtained by offer- 
ing a sample test item, describing appropriate item cont^t and 
specifying characteristics and types of answers- that can be used as 
distractors in objective test items.^ ^'Expanded objectives" (or 
"domain specifications") facilitate the preparation of test items to 
measure objectives and improve the clarity of test score interpretations. 

The second phase of test construction involves the development, 
analysis, and selection of items. With nomt^referenced tests, a 
large set of items Is initially constructed to reflect the objectives 
outlined in step one. Preliminary forms of the test are constructed 
and administered to examinees similar to those for whom the test is 
intended. Later, the items are studied in terms of their difficulty 
and discrlminaticm. Since the major purpose of a norm^-referenced test 
is to compare an individual's performance, knowledge, or skill, to 
that of some reference group, a suitable norm^ref erenced test will be 
constructed with those items that contribute most to maximizing test 
3core variability. Comparisons among examinees are more reliable when 
test scores are dispersed widely. Hence, the final item selection is 
dependent not only cm the objectives of interest, but also on the 
statistical characteristics of ttie available items. 

On the other hand, since the universe or domain of items is 
specifically defined for a criterion-referenced test, item selection, typically* 
consists of selecting a set of representative Items from the domain. 

If more than one objective is measured by a test* a set of representative 

items from the domain of items matched to each objective is drawn. 

Item statistics play a secondary role to item representativeness in 

criterion*ref erenced test item selection. 

8 
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Finally, test scores are reported and used in a way consistent 
with the testes purpose* A norm referenced test score is reported as a 
raw score and one or more derived scores (for example, percentile scores, age 
or grade-equivalent scores, and standard scores)* Raw scores alone 
have very little nkeaning* Inferences cannot be made as to what the 
individual knows or does not know* The derived scores give specific 
information concerning the relation of an individual Vs knowledge, 
skill or ability, to that of a particular reference group* The score 
(or scores) on a criterion-referenced test, however, provides informa- 
tion concerning the relationship of an individual's knowledge, skill 
or ability to a given specified domain of content* 

The intrinsic differences between criterion-referenced and norm* 
referenced measurement have important implications for their use in the 
evaluation of educational programs* A major shortcoming of the use of 
norm-referenced tests in program evaluation results from the discrep- 
ancy between the content covered by a test and the content of the 
program that is being evaluated* The tests that are most commonly 
used in evaluations are used nationwide and are based on an amalga- 
mation of objectives of programs from all over the country* Each 
program has different instructional objectives and the instruction of 
particular objectives may occur at different times* The overlap of instruc- 
tional objectives and test objectives will not usually be complete and 
the degree of overlap will change from program to program* This is 
particularly true in compensatory educational programs, where the 
objectives may be more basic and specific than the general objectives 
reflected in norm-referenced tests* Koreover, each curriculum 
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typically depsnds on the people teaching the program and their priorities 
and emphases. In general* It iflll be difficult to find a atandardized 
achievement test where the content closely matches the content goals of 
a particular program being evaluated* It is not uncomokon therefore to 
hear the charge of "unfairness" when a norm-referenced test is used 
in program evaluation. 

A aecond source of the discrepancy between test content and program 
objectives arises directly from a major purpose of norm^referenced tests, 
i«e. ) to compare an individual's performance, knowledge or skill to 
that of some reference group* In order to effectively obtain this 
type of information from a test* the test must be constructed with 
that purpose in mind. Consequently, norm^ referenced tests consist of 
test items that contribute most to maximizing test score variability. 
In the process of choosing items that contribute sufficiently to test 
variability, those contributing less to variability are eliminated. 
It is clear that items tapping concepts taught successfully by a 
great number of teachers will contribute little to test score varia- 
bility (most students will answer the items correctly) and will be 
eliminated* while the items measuring pure reasoning ability will have 
greater variability and will be retained. In other words, many 
instruction^related skills are systematically eliminated, and the^ 
variation that remains is primarily due to the effects of non-instruction 
related variables. Vhen "easy" and "difficult" items are deleted, 
resulting tests look less like achievement tests tod more like aptitude 
tests (Popham, 1978b) . If an instrument is to be sensitive to the 



10 



•9- 



learning process, its content oust be carefully matched to that of the 
progran. Since, at present, many prograics to be evaluated are innova- 
tive, not only are the Instructional methods Jiffercnt, but often the 
goals and objectives of these progratDs are different from those of the 
traditional program. As a resu].t, a norm-referenced test score ifiay 
be inapproriate since it does not indicate knowledge in terms of the 
instruction. It would often be a mistake to judge a new program 
according to the standards of a traditional program. 

Criterion-referenced tests, however, are constructed or can be 
selected specifically to inatch the goals dnd objectives of a program, 
and since item quality depends exclusively on the ability of the item 
to reflect the domain, this match is not lost in the item selection pro- 
cess as it may be in a normrreferenced test. Consequently, criterion- 
referenced test scoresi assuming the test from which the scores are 
derived is constructed and administered properly, are valid indicators 
of performance or achievement in relation to the instructional 
objectives of the program. 

Perhaps the greatest advantage of using criterion-referenced 
measurement in the evaluation of educational programs result from 
the range and quality of information obtainable from the test scores. 
Because of the match between the test content and instructional 
objectives, criterion-referenced scores permit a description of an 
individual in terms of clearly specified domains of content. For 
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example^ it may be said that a student has mastered 60% of a set of 
program objectives* Bowevar^ it is not always the case that infor- 
mation is required on each Individual or all objectives* Particularly 
in program evaluation^ an evaluator often will want to know how some 
group of students in general has been affected by an educational 
program rather than any given individual* Since this is the case« 
it is possible with criterion-referenced testing to make very efficient 
use of items* A procedure referred to as '^item-examinee sampling*^ 
provides for optimal efficiency in Information gathering when there 
are practical limits on the number of items that can be reasonably 
administered to an individual* This tppic will be considered in 
detail in a later section* 

In mo^'t evaluatioi^s of educational programs it is not only 
important to know something about the achievement of those served by 
a program in terms of the prescribed objectives^ it is also valuable 
to be able to compare the performance of individual in a program to 
the performance of various other groups* Even though criterion- 
referenced tests are not constructed specifically to maximize vari- 
ability of test scores and the frequency distribution of the test 
scores may be homogeneous and hence less useful for ranking 
individuals^ norm-referenced interpretations of criterion-referenced 
test scores can be made and can be of considerable value* As long 
as objectives are held in common^ comparisons of criterion-referenced 
test scores among examinees or groups of examinees can be made* 

Articles by Ebel (1978) « Fopham (1978b) « and Mehrens and Ebel 
(1979) provide additional insights into the topics considered in this 
section* 

12 
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Steps in Crlterloti-Keferenced Teat Development 
In this section the essential atepa in criterion^referenced test: 
development are introduced* A 12 atep model ±0 presented in Figure 1 
(Haiableton £r Eignor, 1979b)* The importance of each step in the model 
depends upon the size and scope of the test development and validation 
project. An agency with the responsibility of producing tests for 
state-wide use will proceed through the steps in a rather different way 
than will a small consulting firm or a group of researchers* 



In brief, the twelve ateps are as followa: 

Step l" 6bjectives must be prepared or selected before the 

test development procesa can begin* 

Step 2 — Test specifications are needed to clarify the testes 

purposes, desirable item formats^ number of test items, 
instructions to item writers, etc* 

Step 3- "-Items are prepared to meaaure objectives included in the 
test (or tests^ if there are going to be parallel-forms, 
or levels of a test varying in difficulty)* 

Step A — Initial editing of items is completed by the individuals 
writing them* 

Step 5" A systematic assessment of items prepared in steps 3 and 
4 is conducted to determine item validities* Essentially, 
the taak ia to determine the content validity of the 
test items* 

Step 6 — Based on the data from step 5, it is possible to do 
further item editing^ and in some instances, discard 
items that do not adequately measure the objectives 
they x^re written to. measure* 

Step 7" The test (or tests) must be assembled* 

Step 8 — A method for setting standards to interpret examinee 
performance is selected, and implemented* 

Step 9" The test (or tests) must be administered* 
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1. Preparation and/or Selection of Objectives 

2. Preparation of Test Specifications (for example, Specification 
of Item Formats, Appropriate Vocabulary, and Number of Test 
Items/Obj ective) 

3. Writing Test Items ^^Hatched" to Objectives 

4. Editing Test Items 

5. Determining Content Validity of the Test Items 

a. Involveinent of Content Specialists 

b. Collection of Student Response Data 

6. Additional Editing of Test Items 

7. Test Assembly 

a. Determination of Test Length 

b. Test Item Selection 

c. Preparation of Directions 

d. Layout and Test Booklet Preparation 

e. Preparation of Scoring Keys 

f. Preparation of Answer Sheets 

8. Setting Standards for Interpreting Examinee Information 

9. Test Administration 

I 10. Collection of Reliability, Validity and Norms Information 

1 

' 11. Preparation of a User's Manual and a Technical Manual 

I 

12. Periodic Collection of Additional Technical Information 



. Figure 1. Steps for Developing and Validating Criterion- 
Referenced Test Scores (From Hambleton & Elgnor, 
1979b) , 
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Step 10 — Data addressing reliability, validity, and norms 
should be collected and analyzed. 

Step XI — A nssr's swaiiial and a t&ch/-.;^:! ir^r^Ai?! should bs 
prepared. 

Step 12 — This step is included to reinforce the point that it 

is necessary, in an on-going way« to compile technical 
data on the test items and tests as they are used in 
different situations with different tjxamiu^ifa r^or^iilat:_>ns • 

Hambleton and Eignor (1979a, 1979b) and Popham (197Sa) ne^cribe 

in detail how to carry out the 12 steps in constructing tests 

describe the performance of individuals . Methods for constructing tests 

for use in program evaluation studies are not nearly so well-developed. 

In the next two sections loethods will be proposed for handling two of 

the four steps, item selection and reliability assessment « which are 

handled differently when building tests to describe the performance 

of groups . 



Approach to Item Selection 

Introduction 

When declslona are to be made concerning an entire educational 
or social program^ group Information rather than individual Information 
Is of primary concern* There are two very Important types of group 
information available when criterion-referenced tests are employed. 
The flrat of these Is the average domain score for the entire group on 
each of the relevant objectives (and across the set of objectives of 
Interest)* An examlnee^s domain a core la his/her proportion-correct score 
In the domain of items measuring the objective. An estimate of the 
average domain score for a group on a particular objective not only 
gives an excellent description of a group In terms of the specific 
objective, but can be used to malce comparisons over timet comparisons 
to other groups^ comparisons among objectives or comparison of the 
group^a performance to some desired standard of performance (possibly 
set by the Instructors of the program of study)* For example, a 
target may be set for a group of examinees to achieve an average domain score 
of .70 on an objective, that Is, a 70% average performance level on items 
measuring the objective* It would be helpful after program implement 
tatlon to compare the average domain score of the group tio the 
chosen standard or target* 

The aecond type of Information available through the use of 
crlterlon~referenced testing Is the percentage of people In a program 
who are classified as masters on any given objective. An Individual 
Is classified as a master or non^^master by cunparing the IndividuaX^^ domain 
score estimate to a cut-off score positioned on the domain score scale. 
It Is thus helpful to know what percentage of those taking part in a 
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given program can be classified as masters* Again» comparisons over 
time^ comparisons between groups^ comparisons among objectives^ or 
comparisons of the group to some standard are e^i^tremely useful in the 
evaluation of effectiveness of program implementation^ 

Besides average domain scores and percent of masters on each or per- 
haps only the most important program objectives^ program evaluator:^ 
usually have an interest also in the variability and distribution of 
domain scores^ and in the percent of examinees in a group mastering 
a specified number of objectives at a specified level of performance* 

It should be noted^ however^ that in order to gather the 
types of information described above^ each student or sample of 
students should be tested by several items for each objective* Testing 
time can quickly become prohibitive* It would not be unusual* for 
example* to have 100 objectives* each tested with 10 items* resulting 
in a total of 1000 items* far too many to reasonably administer to any 
group of people* It is possible* however* to utilize sampling plans 
in order to gather information more efficiently* 

The simplest sampling technique is to choose a random* or 
stratified random sample of examinees from the examinee population* 
and administer the entire test to the sample* This is known as 
examinee- sampling * Although this procedure reduces the total amount 
of testing* each individual that is selected may still be tested to an 
unreasonable extent* An improvement on this is another sampling pro- 
cedure referred to as item-samp ling * Here, items are randomly selected 
(or perhaps stratified on difficulty level) from the domain of items 
measuring each objective and administered to all examinees* This is 
actually the situation that occurs in criterion-referenced measurement. 
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A representative set of items is selected from the domain of items mea~ 
suring an objective in order to make inferences about the entire domain* 
Unfortunate ly« since the nuad>er of objectives to be measured by a test is 
often large« the number of test items measuring any single objective is 
likely to be quite small and therefore adequate domain coverage of an 
objective is difficult to ensure* 

Fortunately, it is possible for the evaluator to provide an 
accurate description of the program with respect to the given objectives* 
while admiik^.*Btering only a fraction of the total number of items to 
any given Individual* This procedure consists of the simultaneous 
application of the two previously mentioned sampling procedures and 
is called item-examinee or matrix sampling * A randomly selected group 
of items Is administered to a randomly selected group of examinees* 
A further refinement, which results in better estimates of population 
parameters, is referred to as multiple matrix sampling. In this case, 
the item-examinee sampling procedure is repeated a number of times. 
A first set of randomly selected items is assigned to a first group 
of randomly selected examinees, followed by the assignment of a second 
set of items to a second group of examinees and so on* Estimates of 
parameters of interest ate calculated for each matrix and then pooled, 
resulting in estimates that can be used to make inferences about all 
examinees on all items* Considerable research has demonstrated the 
feasibility, desirability and efficiency of matrix and multiple matrix 
sampling procedures (Shoemaker, 1973a; Sirotnik, 197A)* 

There are several item-examinee sampling designs that program 
evaluators n^ay find particularly useful for applications involving 
criterion-referenced testing- Next, some practical considerations 

in choosing a design will be discussed, followed by a presentation 

^8 

of specific designs* 
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Some Preliminary Considerations 

There are many practical aspects to be considered In choofi^ing an 
efficient sampling plan,^ Total testing time, the number of objecMv^s, 
the number of items per objective, and the number of examinees must 
all be considered In light of the desired degrees of precision for ;he 
statistics of interest* The amount of time allotted for testing Is very often 
restricted, if not by conditions Intrinsic to the program Itself, then by the 

length of time one can expect an examinee to respond to test Items. The 
nxmber of objectives to be tested must also be considered and decisions 
made as to whether or not It Is critical that each and every objective 
be tested. 

In some situations. It Is more important to have more reliable 
Information on a subset of objectives rather than less reliable infor-^ 
matlon on all objectives* This may be particularly true when it is 
of Interest to report the percent of examinees who are classified as 
masters on a given objective* In order to classify an examinee as a 
master ox non-^master reliably, several items must be used for a given 
objective* Since this may result In an unreasonably long test, it 
may be necessary to establish priorities for the objectives, and measure 
most completely, only those objectives basic to the purposes of the 
program* This may be accomplished particularly if the objectives axe 
structured hierarchically, that is, mastery of one objective is a pxe- 
requisite to mastery of others* Priorities can be established to 
reflect this* 

sampling plan describes the number of different tests that 
will be constructed, the number of items in the tests, and the number 
of examinees who will be administered each test* 

19 
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in contrast, it may be more iioportant in some situations to 
report infoimaticm on all objectives of a program. Since the 
number of objectives is often large, obtaining information on each 
objective and at the same tine maintaining a test of reasonable 
length may not be feasible. This problem can be overcome* through 
the uee of multiple matrix sampling. 

In designing a sampling plan» since the number of examinees^ items* 
obJectiv€sand items per objective have a direct bearing on the pre- 
cision of estimates » the evaluator often must arrive at a compromise* 
sometimes sacrificing precision in order to arrive at d feasible 
test plan. Other aspects that need to be taken into account relate to 
the nature of the objectives and test items of interest. Objectives 
tested through use of items that require special directions, practice 
questions* or verbal presentation can have an effect on the develop- 
ment of a sampling plan. In these cases it is an inefficient use of 
time to sample a very small number of items from the objective. It 
is perhaps more reasonable to test e^ch examinee group with fewer ob- 
jectives and more items per objective. The complexity of the domain also 
has an effect on the number of items selected to measure an objective 
and the method of item selection. More items must be used with complex 
domains to insure item representativeness. Stratification of the 
items may be necessary to insure complete, yet efficient* coverage of 
the item domain. 
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.f;c'let!t:io n of De i;|fins 

In this section of the paper a few designs are presented that 

are particularly suited for use with crlterlon-referenced testing in 

program evaluation. The notation and definitions used here vill be in 

keeping with that suggested by Shoeinaker and Knapp (1974)* The number 

of items in the domain of Itepis measuring an objective is denoted i>y 

K« the number of Items measuring an objective in a test by Ic, the total 

number of examinees In the population Is denoted by N« and the number 

of examinees taking each test by n. A particular sampling plan for 

the collection of test data In relation tp an objective* then, can 

be represented as t/k/n where t Is the number of tests* 

Multiple matrix sampling plans can be with or without replace* 

ment on both the Item and examinee dimensions. In evaluation settings « 

it is important^ when -sampling examinees, to choose a given examinee 

only once* This ensures maximum coverage of examinees and reduces 

testing time on the part cf an examinee while avoiding confounding 

effects due to an examinee taking more than one test* Similarly* 

sampling of items without replacement is Important to ensure domain 

coverage and avoid overlapping tests* Thus, it Is apparent that 

In evaluation settings, sampling of items and examinees without 

replacement Is the most meaningful and feasible sampling plan to 

consider* For the purposes of this paper, we shall therefore assume 

that all sampling is without replacement* 

If each of the K items of an Item domain Is assigned to at 

least one test the sampling Is said to be exhaustive in the item 

dimension* Likewise, examinee sampling may be referred to as either 

exhaustive or non^exhaustlve depending on whether or not the entire 

group of N examinees Is tested* The choice of sampling Is largely 
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dependent on the type of inferences the evaluator wishes to draw from 
the resulting data. In the particular application of criterion-referenced 
measurement to the evaluation of prograas* the inference to be made 
from the item dimension is (different from that in a typical item sampling 
plan. Ordinarily* item sa&pling is used to estimate a groups* perform- 
ance on a fixed length teat (K items) by looking at performance on tests 
of length k. The important point here is that the inference is made to 
some particular set of K items. In criterion-referenced measurement, 
however, the inference of interest is not to some fixed set of items 
but to a well-defined but very large domain of test items. Consequently, 
items are in effect randomly chosen from the well*-defined domain and used 
to estimate examinees* success on the domain of -interest (Sirotnik, 1974)* 
It is clear, then, that since the inference is to be made to the entire 
domain from a sample of items, item sampling ±s non-exhaustive when 
criterion-referenced interpretationa are to be made of the scores. It 
should also be noted that, in the evalustioti of a program, information 
about many domains is often required. The multiple matrix sampling must 
occur tfithin each domaini since generalizations are to be made to each 
domain of interest. 

The sampling of examinees, can be one of three types; exhaustive, 
non-exhaustive from a finite population, and non^exhaustive from an 
infinite population. An example of an exhaustive sampling plan is 
when every person in. a program is tested on some subset of items keyed 
to an objective. For example* a population of 1000 examinees is divided 
into four subgroups of 250 examinees each, and each group is adminis- 
tered 5 itans randomly choaen from a domain. Each examinee receives 
3 items and the information from this is used to make an inference 
about those 1000 examinees on the entire domain. The second type of 
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examinee sampling comes about when sampling is done from a fixed pop- 
ulation of examinees and not all those in Che population are tested. 
7or example, suppose there are 100 objectives to be tested on a group 
of 1000 examinees. The population of 1000 examinees can be divided 
into two random samples, each sample of examinees responding to iteins 
representing one half of the total objectives. Although 500 examinees 
in each case are tested on an objective or domain, the inference Is 
to be made to the original population of 1000 examinees. Within a 
given objective, examinee sampling is non-exhaustive. This design is 
particularly applicable when information on many objectives is col*^ 
lected simultaneously, since each objective is tested on only a 
sampleof the population. An obvious extension of the abovi^'is non- 
exhaustive sampling from an infinite population. This design is 
appropriate whenever the size of the examinee sample is small in 
relation to the size of the population. A major advantage of this 
plan is that it simplifies statistical computation. Schematic repre- 
sentations of several types of sampling plans considered so far are 
presented in Figure 2. 

As mentioned earlier, it is important, when choosing a design to 
implement, to consider carefully, the nature of the information needed. 
Several types of information will be addressed next. These are: (1) the 
mean and variance of domain scores on an objective, (2) the entire domain 
score distribution on an objective, (3) percent of masters on an objective, 
and (4) percent of examinees mastering a given percent of objectives. 
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Figure 2. RepTesentation of fieveral typed of fiampling plans. 
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Multiple Matrix Samples (3); Non*exhaustlve Examinee 

Sampling; Non*£xhaustlve 
Item Sampling 



Since the test Items that measure an objective are only a aample of the Items from the domain of 
Items of Interest, all sampling plans will be non-'exhaus t Ive of the Item domain. 



^Remaining items (unused or unwritten) iti the domain of Items measuring an objective. 



25 



Figure 2 (continued) 



I 

1 



Objective 1 

^ Test Items ^ * 

Sample 1 Sample 2 



Objective 2 
—Test Items- 



Objective 3 
—Test I terns- 



Objective 4 
—Test Items- 



Sample 1 Sample 2 Sample 1 Sample 2 Sample 1 Sample 2 



R* 



X 






X 






X 






X 








X 






X 






■ 

X 









X 





Multiple Matrix Samples (2); Across Objectives (4); Exhaustive-Examinee Sampling; Non-Exhaustive 
Item Sampling. 
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Multiple Matrix Samples (2); Across Objectives (4); Non-Exhaustive Examinee Sampling; Non-Exhaustive 
Item Sampling. 
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(l) Eatlmatlon of the Mean and Variance of Poinaln Score s on an O bjective 
A parameter of considerable iioportance is the average domain score 
for the population of examinees on an objective* 

The unbiased estimate of the average domain score m is given by 



1 ^ - 
^ £=1 ^ 



where the average domain score for test 4i X^, is given by 



k n 

^ ^ 4-1 1-1 



The quantity X^^ Is Che score of the 1th individual on the jth Item 
measuring the objective under stody In the £th test* 

A convenient and Intuitively appealing way to approach the esti- 
mation of -the exaiDlnee domain score variance was presented by Slrotnik 
(l970)* He rederlved the formulae, presented earlier by Lord and Novick 
(1968), using an examlnee-by-ltem analysis of variance design* 
Examinees and items are seen as random effects and the item and examinee 
population can be viewed as either finite or infinite depending on the 
design at hand* Through the usual analysis of variance procedure the 
mean square due to examinees (MS^), mean sqtiare due to Items (MSj) and 
the residual or mean square due to interaction (MSgj) can be calcu- 
lated* Prom these* variance components of interest can be obtained 
as follows (for finite populations of examinees and items) 
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and 



EI NK 



MS. 



EI 



The statistic 0| Is then, the estimate of the population varlancs. 
If the size of the population of Items (K) approaches infinity > 



-2 N-1 r "^E " "^EI 



and if both the population of items and the population of examinees is 
infinite, the variance is given as follows: 



An estimate of domain score variance is obtained from each of the t tests and 
the t values are Averaged resulting in a more stable estimate of the 
population parameter* 

It is often of interest to compare the average domain score of 
a group to some established standard* For example, an average domain 
score of at least *80 may be required for "success" on a given objective. 
A statistical comparison of the estimate of average domain score to 
the standard x^ould be helpful* Rather than comparing the estimate 
of average domain score to a standard, it may be of Interest to 
compare two groups^ for example, experimental and control groups. 
Although it is possible to test several other hypotheses using estimates 
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obtained through multiple matrix procedures « the hypotheses discussed 



here are (a) Hq: U ■ c 
veraus 

Ha: U ^ c 



and 



(b) Ho: Ui - U2 

versus . 

Ha= * ^2 



To test hypotheses concerning the parameter^ y« the estimate of the 
standard error must be calculated. The analysis of variance formulation 
can again be used to estimate the three variance components « variance 
due to IteiBSf variance due to exenlnees^ and a|^« variance due to 
Item-examlnee Interaction^ for each teat. As was mentioned earlier^ 
these variance estimates are pooled across tests to yield a pooled 
variance estimate. The standard error of estimate of the mean domain 
score can then be expressed by: 



ICN-n)(K-k) + nkCt-l)la|J 



30 



Examinee sampling that l8 non^exhaustlve, yet from a finite population 
and Item-sampling that l8 non-esfaau8tlve from an Infinite population 
result In 

^{il-U2"^ tnk(N-l) f^CN-nt)82 + n (N-Da^ + CN-n)a|j] . fg] 

Finally, If both the nuoiber of examinees and Items are allowed to 
approach Infinity, the expression simplifies to 

After choosing the correct standard error of the estimates, the test 
statistic can be calculated as follows 

a- - 
P1-P2 

This quantity is approximately distributed normally. Hence, the 
computed « value can be compared with the tabulated values and the 
appropriate decision concerning the hypothesis can be made. 

In practice, the sampling of examinees will be usually exhaustive 
or non'exhaustive from a finite population and the sampling of items 
will be non^exhaustive from an infinite pool of itensand therefore 
Equations [3] and [4], will prove to be more useful than either 
equation [2] or [5]. Likewise, when comparing two groups^ equations 
[7] and [8] will be applicable more often than equation [6] or [9]^ 
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(2) EBtlraatlon of Domain Score DlBtrlbution on an Objective 
Multiple matrix saiiq^ling procedures were Introduced Initially 
to enable test constructors to obtain better test score norms (Lord. 1962) 
By requiring schools to administer fewer test items it was felt that 
more representative test norms could be obtained because fewer schools 
would decline to participate in a norming study. This would 
result in more representative samples of examinees to estimate 
the distribution of test scores in the examinee population of 
interest. 

Although matrix sampling was developed primarily for purposes 
of norm- re fere need measurement, the evaluator who is using criterion- 
referenced measurement^ may find the estimation of the entire distri- 
bution to be valuable* There are times when describing group performance 
on an objective by a mean and variance alone is insufficient. Information 
about particular percentiles may be needed* For example^ it may be of 
interest to know the proportion of students who have domain scores 

above a value of .80 on a particular objective. 

Several approaches to the estimation of an entire distribution 
have been Investigated (Brandenburg & Forsyth^ 1974a)* Lord (1962) 
presented a successful application of item sampling using the negative 
hypergeometric distribution to estimate a test score distribution. 
This procedure is relatively straightforward since the distribution 
is fitted to the three parameters, mean^ variance and number of items. 
Further work with the negative hypergeometric distribution was con- 
ducted by Shoemaker (1970). He systematically varied the number of 
tests ^ number of items per test^ and the nuober of examinees receiving 
each test and studied the fit of the estimated distribution to the 
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actual distribution. Shoemaker concluded that for ym£ll nvj^xrij. oi 
observations the fit is variable^ but as the number of observations 
increases beyond a certain point Cl*23% of the norm data basft in this 
study) all procedures produce equivalent re^^ults. 

Brandenburg and Forsyth (l974b) studied the use of multiple 
matrix sampling to estimate the parameters of the negative hype^ geo- 
metric distribution. They compared the distribution to that obtained 
through estimation of the parameters of the Pearson Type 1 distribution. 
In order to specify a particular Pearson Type X curve^ the first four 
moments must be estimated* These parameters were estimated through 
use of Lord's (i960) formulae. Brandenburg and Forsyth concluded that « 
in general^ the Pearson Type 1 model tended to yield the better fit 
of the two models* Since the Pearson Type 1 procedure requires esti- 
mation of the first four moments^ more items are required per test in 
order to get a stable estimate of the distribution* When the number 
of items per test is relatively small* the n^ative hyp ergeome trie 
may be more appropriate since only two moments of distribution need 
to be estimated. More study is needed with regard to the effect of the 
choice of the sampling design on the fit of the models to the actual 
distribution* In particular « the study of the fit of the two models 
to various skewed distributions is critical* since much criterion- 
referenced test data seems to be either positively or negatively skewed. 
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(3) Estimation of Percent of ^sters on an Objective 
One of the Q^ajor purposes of Criterlon*referenced measurement 
Is to provide a mastery /non-^maetery decision for a given Individual. 
In program evaluation^ however^ Infomation on a given Individual Is 
not critical* Reliable group Information Is what Is needed to make 
program decisions* For example^ if 8SZ of the population served by 
a particular program* achieved mastery status* the program must be 
accomplishing something* Whether or not 8SZ . Is an adequate 
level of mastery mu^t be concluded by comparing the value to some 
previously established standard* 

If every person In the entire population Is tested with enough 
Items to make a reliable mastery decision* the percent of masters 
can be obtained by simply calculating th^ percent of students 
classified as masters* Then* the obtained percent can be directly 
compared to some standard set by tl^e program designers* If* however, 
It Is Impossible to test all eKamli^ees on all objectives with enough 
Items on each to make reliable mastery decisions^ It Is necessary 
to do some careful sampling* " 

One solution Is to carry out examinee-sampling on each obj^i^ctlve* 
make reliable mastery decisions on the chosen sample of examinees by 
using a sufficient number of test Items, and use the proportion of 
masters In the examinee fample to estimate the proportion of masters 
In the entire population* 

The multiple matrix sampling plans presented earlier can apply 
In this situation, as long as enough Items on given objectives are 
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administered to individuals* But there are some special cons idt^rat ions 
when the variable of interest is the percent of examinees in tha poptilation 
who have achieved some minimum level of performance on an objeciiive. 
When the number of items administered to a student does not 

allow for setting a performance standard equal to the one which 
applies to the domain of items measuring an objective^ the resulting 
percent estimate will be biased* For example, suppose the performance 
standard is .80 and two items are administered per objective. There 
are only three possible cut-off scores; 0, *50, and 1*00* If 1*00 
is selected, some examinees who can meet the *80 standard will be 
assigned to a non-mastery state and therefore the estimate of the 
percent of masters will be too low* On the other hand, if, .00 or .50 
are selected, some examinees who could not meet the *80 standard will 
be assigned to a mastery state and therefore the estimate of the percent 
of masters will be too high* Clearly, if the ^'actual" and the "true*' 
cut-off score differ, biased results (in a known direction) will be 
obtained and the seriousness of the bias will be related to the 
difference of the two cut-off scores. The implication of this is 
clear; sample examinees, and administer each examinee a sufficient 
number of items to enable the cut-off score on the sample of test items 
to equal the desired cut-off score in the pool of test items measuring 
the objective (i*e*, if the true cut-off score is .75, the number of 
test items administered must be a multiple of 4 so that the cut-off 
score set on the sample of items can also be set equal to the value, 
*75)* Assuming the amount of test data to be collected is fixed in 
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estlmatlng the percent of masters of an objective in n populati.on. 
It is not clear whether it would be better to use (1) short t€:Sts 
and many examinees or (2) longer tests and feuer examinees. 

It is also possible to approach the t>rcbletn of estimating the 
percent of examinees^exceeding some standard of perf<)rtnance (student>s 
defined as "masters'* of the objective) through the use of procedures 
presented in the previous section* Rather than getting reliable 
mastery decisions on a sample of examinees and Inferring the true 
percent of masters^ it is possible to use multiple matrix sampling 
to estimate the entire score distribution and infer the percentage of 
examinees that lie above a given minimum lovel of performance. It 
remains to be seen which of the two proci'^duro^^ dehC-ribed results in the 
most efficient and yet accurate results* 

Hypotheses concerning the percent of students reaching mastery, 
parallel those presented in the previous section. The first relates 
to the comparison of the estimated percent of masters to some pre- 
established standard. A second hypothesis of interest concerns & 
comparison of percent of masters across differt^nt objectives. Finally, 
there niay be interest in a comparison of percent of masters acrosr> 
two or more groups , 

(A) Estimation of. the Percent of l- xani in ees Mastering a 
a Given Percent of Objectives 

It is often of interest to represent the success of a group on 

an entire set of objectives. For example « statements such as the 

following can be extremely descriptive of the success of a program: 
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"Eighty percent of the group mastered at least seventy^five perceut 
of the objectives*" To do this efficiently, it is possible to pVGt;ertt 
samples of examinees with item samples selected from a representa- 
tive subset of objectives. Inferences sre drawn to all examinees* 
to the entire item domain and finally, to the entire set of objectives. 
This procedure does, however, hinge on the "representativeness" of the 
subset of objectives* 



Discussions of various approaches to reliability of criterion* 
referenced measurement are readily available in the literature (for 
example, see Hambleton, Swaminathan, Algina, £r Coulson, 1978)* The 
emphasis in the work to Oate, however, has been on the reliability of 
individual test scores and associated decisions* There are ample 
methods and guidelines to aid the practitioner in estimating the 
reliability of domain score estimatea and mastery decisions* Since 
group information is of most interest to program evaluatorsf the reli- 
ability of the group statistics (average domain score and percent of 
masters, for example) are of concern, rather than reliability of 
individual scores* Reliability then is the accuracy of estimation of 
the group derived estimates* In the estimation of domain scores, the 
accuracy is expressed in terms of the standard error of estimation 
presented in Equations [3] through [6]* In the estimation of proportion of 
masters (P) , the degree of precision is given (approximately) 



by the formula ^H^^ • 
The variables that affect the accuracy, are the number of tests (t), 
number of examinees per group (n) and the number of items per test (k) * 



Reliability of Group Scores 
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SttpfxiHe t.ht^ cont ent coini>Dfti;nt of a it'yilnf lu try,,- a, k 

prisec! of 50 well-daf Incd objectives^ Let iis al;;n supi^oi;*- £ In,* prn^'jp.h: 
If? serving 5000 students CN«5000) - It Is approfjri^Ur ;it tViir, timi-, for 
the evaluator to fix the desired accuracy ot cstiinaLi^s and cltoos^ 
values o£ t and n that will result In standard errors nf ■j*;nlTn<.u 
less than some desired valuo- As sugger^ted by Shoenieii^er (I97ia) t. ;/ 
possible procedure is to plr-'.:e, in tlie eqiMcK'^n for vAh^ sLandiirti 
error of the estimate, an ^^ccleptable valuer uor t ho ;?tanri;trd error* 
the equation can then be solved for t, the r.urabcr oi Le.'.t,s. The 
difficulty with this procedure is that initial estimates of og^ 
og, and 0^ iDust be substituted in the expression. Rough estimat&.s« 
however^ could be obtained through pilot testing, nr from norms 
studies. According to the guidelines; presented by Shoemciker (1973^) 
the total number of observations^ the product tkn, is the most im- 
portant variable to consider when attempting to achieve ^. pnrticuUir 
level of accuracy- As the total number o:' ubscrvati.on.s int:reaaod» 
the si2e of th^ standard error ^>f estimatt\- C!:'rri:fSpontiini^.ly dii::rL>,i.ses , 
Another point presented by Shoemaker that particularly imporcai:t ir 
this application relates to the distributional nature of the data. 
"For normal normative distributions* increases in the number of irtms 
per ccst are most effective in reducing jiCaiidaro errorij o! eatiuati 
for negatively-skewed distributions or po^Ltively-sknwed distribur ions 
increases in the nuiaber of tests are most effective-" Since 
criterion-referenced test data tend to be skewed^ ruuist effccLivi^ v 
o£ decreasing the standard error. o£ estimate is tc increase t* tin.* 
number o£ tests. After d^iciding upon values of t, k and n, it may 
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be Che case chat when all objectives are con3idered^t:e8tin£', timt* 
becomes prohibitive* For example, suppose the following v^ilims 
decided upon, t » 10, k » 5, and n « 500, If all objectives art» tested 
in a like manner, each student must respond to a total of 250 isen^^'. (f>0 
objectives x 5 items/objective)* It may be necessary in thi^ ca^^^ to 
reduce k, or to administer fewer objectives to each individual, Thf^rt* 
is no unique solution to the choicer of t, k, and n* The choices very 
often depend on practical considerations* 

Conclusion 

Program evaluators often find that it is important to evaluate 
programs with respect to the goals and objectives of the program and 
consequently they turn to criterion-referenced measurement* Criterion- 
referenced test scores can provide both descriptive and normative 
information* To date criterion- referenced test technology has been 
mainly directed toward information concerning individuals * In this 
paper, technical considerations associated with item selection and 
reliability assessment in relation to criterion^referenced tests 
constructed to provide group information were discussed* Hopefully, 
some of the ideas expressed in this paper will help to iShape the 
technology for building tests and evaluating test scores in program 
evaluation studies* 
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